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Abstract. We present a dynamic programming-based solution to a stochastic optimal control 
problem up to a hitting time for a discrete-time Markov control process. First we determine 
an optimal control policy to steer the process toward a compact target set while simultane- 
ously minimizing an expected discounted cost. We then provide a rolling-horizon strategy 
for approximating the optimal policy, together with quantitative characterization of its sub- 
optimality with respect to the optimal policy. Finally we address related issues of asymptotic 
discount-optimality of the value-iteration policy. 



§1. Introduction 

Optimal control of Markov control processes (MCP) up to an exit time is a problem with 
a long and rich history. It has mostly been studied as the minimization of an expected undis- 
counted cost until the first time that the state enters a given target set, see e.g., [12, Chap- 
ter II], [20, Chapter 8], and the references therein. In particular, if a unit cost is incurred 
as long as the state is outside the target set, then the problem of minimizing the cost ac- 
cumulated until the state enters the target is known variously as the pursuit problem [16], 
transient programming [34], the first passage problem [15, 25], the stochastic shortest path 
problem [6], and control up to an exit time [11, 12, 24]. These articles deal with at most 
countable state and action spaces. The problem of optimally controlling a system until an exit 
time from a given set has gained significance in financial and insurance mathematics, see, 
e.g., [10, 32]. 

Our interest in this problem stems from our attempts to develop a general theory of sto- 
chastic model-predictive control (MFC). In its bare essentials, deterministic MFC [26] consists 
of two steps: (i) solving a finite-horizon optimal control problem with constraints on the state 
and the controlled inputs to get an optimal policy, and (ii) applying a controller derived from 
the policy obtained in step (i) in a rolling-horizon fashion. Theoretical foundation of stochas- 
tic MFC is still in its infancy, see [29, 7, 33, 14, 2] and the references therein for some related 
work. In view of its close relationship with applications, any satisfactory theory of stochastic 
MFC must necessarily take into account its practical aspects. In this context an examination 
of a standard linear system with constrained controlled inputs affected by independent and 
identically distributed (i.i.d.) unbounded (e.g., Gaussian) disturbance inputs shows that no 
control policy can ensure that with probability one the state stays confined to a bounded safe 
set for all instants of time. This is because the noise is unbounded and the samples are in- 
dependent of each other Although disturbances are not likely to be unbounded in practice, 
assigning an a priori bound seems to demand considerable insight. In case a bounded-noise 
model is adopted, existing robust MFC techniques [4, 8] may be applied, in which the central 
idea is to synthesize a controller based on the bounds of the noise such that the target set be- 
comes invariant with respect to the closed-loop dynamics. However, since the optimal policy 
is based on a worst-case analysis, it usually leads to rather conservative controllers and some- 
times even to infeasibility. Moreover, complexity of the optimization problem grows rapidly 
(typically exponentially) with the optimization horizon. An alternative is to replace the hard 
constraints by probabilistic (soft) ones. The idea is to find a policy that guarantees that 
the state constraints are satisfied with high probability over a sufficiently long time horizon. 
While this approach may improve feasibility aspects of the problem, it does not address the 

Date: September 28, 2009. 

2000 Mathematics Subject Classification. Primary: 90C39, 90C40; Secondary: 93E20. 



2 



D. CHATTERJEE, E. CINQUEMANI, G. CHALOULOS, AND J. LYGEROS 



issue of what actions should be taken once the state violates the constraints. See [23, 22, 1] 
for recent results in this direction. 

In view of the above considerations, developing recovery strategies appears to be a neces- 
sary step. Such a strategy is to be activated once the state violates the constraints and to be 
deactivated whenever the system returns to the safe set. In general, a recovery strategy must 
drive the system quickly to the safe set while simultaneously meeting other performance 
objectives. In the context of MPC, two merits are immediate: (a) once the constraints are 
transgressed, appropriate actions can be taken to bring the state back to the safe set quickly 
and optimally, and (b) if the original problem is posed with hard constraints on the state, in 
view of (a) they may be relaxed to probabilistic ones to improve feasibility. 

In this article we address the problem of synthesizing optimal recovery strategies. We 
formulate the problem as the minimization of an expected discounted cost until the state 
enters the safe set. An almost customary assumption in the literature (see, e.g., [21] and 
the references therein,) concerned with stochastic optimal control up to an exit time is that 
the target set is absorbing. That is, there exists a control policy that makes the target set 
invariant with respect to the closed-loop stochastic dynamics. This is rather restrictive for 
MPC problems — it is invalid, for instance, in the very simple case of a linear controlled system 
with i.i.d. Gaussian noise inputs. We do not make this assumption, for, as mentioned above, 
our primary motivation for solving this problem is precisely to deal with the case that the 
target set is not absorbing. As a result of this, it turns out that the dynamic programming 
equations involve integration over subsets of the state-space and therefore are difficult to 
solve. At present there is no established method to solve such equations in uncountable state- 
spaces. However, in finite state-space cases tractable approximate dynamic programming 
methods [5, 28] may be employed to arrive at suboptimal but efficient policies. 

This article unfolds as follows. In §2 we define the general setting of the problem, namely, 
Markov control processes on Polish spaces, their transition kernels and the main types of 
control strategies. In §3 we establish our main Theorem 3.7 under standard mild hypothe- 
ses. This result guarantees the existence of a deterministic stationary policy that leads to the 
minimal cost and also provides a Bellman equation that the value function must satisfy A 
contraction mapping approach to the problem is pursued in §4 under the (standard) assump- 
tion that the cost-per-stage function satisfies certain growth-rate conditions. The main result 
(Proposition 4.6) of this section asserts both the existence and uniqueness of the optimal 
value function. Asymptotic discount-optimality of the value-iteration policy is investigated 
in §5 under two different sets of hypotheses; in particular, the results of this section show 
that rolling-horizon strategy approaches optimality as the length of the horizon window in- 
creases to infinity. A rolling-horizon strategy corresponding to our optimal control problem 
is developed in §7; in Theorem 7.2 we establish quantitative bounds on the degree of sub- 
optimality of the rolling-horizon strategy with respect to the optimal policy. We conclude 
in §9 with a discussion of future work. The state and control/ action sets are assumed to be 
Borel subsets of Pohsh spaces. 

§2. Preliminaries 

We employ the following standard notations. Let N denote the natural numbers {1, 2, . . .}, 
and No denote the nonnegative integers {0} U N. Let l^(-) be the standard indicator function 
of a set A, i.e., l^(i^) = 1 if E, & A and otherwise. For two real numbers a and b, let 
a Ab := min{a, b}. 

Given a nonempty Borel setX (i.e., a Borel subset of a Polish space), its Borel cj-algebra is 
denoted by 23(X). By convention "measurable" means "Borel-measurable" in the sequel. If X 
and Y are nonempty Borel spaces, a stochastic kernel on X given 7 is a mapping Q(-|-) such 
that Q(-|y) is a probability measure on X for each fixed y e 7, and Q(B|0 is a measurable 
function on Y for each fixed B e !B(X). We let ^(X|y) be the family of all stochastic kernels 
onX given Y. 

We briefly recall some standard definitions. 
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2.1. Definition. A Markov control model is a five-tuple 
(2.2) {X,A,{AM\x&X},Q,c) 

consisting of a nonempty Borel space X called the state space, a nonempty Borel space A called 
the control or action set, a family {A(x) | x e X} of nonempty measurable subsets A[x) of A, 
where A[x) denotes the set of feasible controls or actions when the system is in state x &X, 
and with the property that the set IK := {(x, a)|x e X, a e j4(x)} of feasible state-action pairs 
is a measurable subset of X x ^4, a stochastic kernel Q on X given IK called the transition law, 
and a measurable function c : IK — > R called the cost-per-stage function. ^ 

2.3. Assumption. The set IK of feasible state-action pairs contains the graph of a measurable 
function from XtoA. O 

We let n, n^, Yl[}^ and n^s denote the set of all randomized and history-dependent 
admissible policies, randomized Markov, deterministic Markov and deterministic stationary 
policies, respectively. For further details and notations on policies see, e.g., [19]. Consider 
the Markov control model (2.2), and for each i = 0, 1, ... , define the space H, of admissible 
histories up to time i as Hq := X, and H; := IK' x X = IK x H;_i for i e N. A generic element 
hi of Hi, called an admissible i-history is a vector of the form h, = (xq, Qq, . . . , X;_i, a^.^, x,), 
with (Xj, Qj) e IK for j = 0, . . . , i — 1 and X; e X. Hereafter we let the cr-algebra generated by 
the history h; be denoted by J^, i e Nq. Let (fi,^) be the measurable space consisting of the 
(canonical) sample space Q := = (X xA)°°, and ^ is the corresponding product cj-algebra. 
Let 71 = (7Ti)igNo be an arbitrary control policy and v an arbitrary probability measure on X, 
referred to as the initial distribution. By a theorem of lonescu-Tulcea [30, Chapter 3, §4, 
Theorem 5], there exists a unique probability measure on supported on H°°, and 

such that for all B e ©(X), C e «B(A), and h, e H,, i e Nq, P^. (xq e B) = v(B) and 

(2.4a) P^a, ec|h,) =7T,(c|h,) 

(2.4b) P,'^(x,+i e B I h„ a,) = Q(B | x„ a,). 

The stochastic process {O.,^, P,'y,(x;);gjjJ is called a discrete-time Marfcov control process. Let 
$ denote the set of stochastic kernels if in 0^(A\X) such that i/7(j4(x)|x) = 1 for all x e X, 
and let F denote the set of all measurable functions / : X — > A satisfying /(x) e A[x) for all 
X e X . The functions in F are called selectors of the set-valued mapping X 9 x ■ — > A[x) c A. 

The transition kernel Q in (2.4b) under a policy n := (i/'i);^^!^ e n^jjy^ is given by (Q(-|-, ¥'i))igNo' 
defined as iB(X) x X 3 (B,x) ■ — > Q(B|x, (/9i(x)) := J^^.^^ (/7,(da|x)Q(B|x,a). Occasionally we 
suppress the dependence of ip^ on x and write Q(B|x, in place of Q(B|x, (/^^(x)). The 
cost-per-stage function at the j-th stage under a policy (v'i)igNo written as c{Xj,ipj) := 
/acx ) simply write ip"^ and /°°, respectively, for policies {^p, if,...) e 

n^s and (/,/,...)£ Dos. 

Since we shall be exclusively concerned with Markov policies and its subclasses, in the 
sequel we use the notation n for the class of all randomized Markov strategies. 

§3. Expected Discounted Cost up to the first Exit Time 

Let K C.X be a measurable set, Xq = x eX and let t := inf{i e No|x; e if }.^ We note that 
T is an [Si)ieN -stopping time. Let us define 



V[tl,x) := El 



^a'c(x,,a,) 



i=0 



ae]0,l[, 



as the a-discounted expected cost under policy n & Yl corresponding to the Markov control 
process (^,5^, P",(x;);gi^J.^ Our objective is to minimize V(n;,x) over a class of control 



^As usual the infimum over an empty set is taken to be +oo. 



^We employ the standard convention that a summation from a higher to a lower index is defined to be 0. 
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policies n, i.e., find the a-discount value function 



T-l 



(3.1) VV):= infV(7i,x)= inf Va'c(x„a,) , 

Tten Tien ^ 



ae]0,l[. 



i=0 



A policy that attains the infimum above is said to be a-discount optimal. 

3.2. Remark. As mentioned in the introduction, the optimization problem (3.1) with a = 1 
and the cost-per-stage function c(x,a) = l^^^j^Cx) is known as the stochastic shortest path 
problem. The objective of this problem is to drive the state to a desired set (K in our case) 
as soon as possible, and the expected cost Vssp(7T,x) for a policy n corresponding to the 
above cost-per-stage function is readily seen to be [t] . In this light we observe that the 
minimization problem in (3.1) with the cost-per-stage function c(x,a) = l;f^^(x) can be 
viewed as a discounted stochastic shortest path problem. It follows immediately that the 
corresponding expected cost V^^^p[Ti,x) is (l — [a^] )/(l — a). Note that the minimization 
of Vjssp(7r,x) over a class of policies is always well-defined for a < 1. Moreover, because of 
the monotonic behavior of the map ]0, 1 [ 3 a ■ — > (l — E^ [a^] ) /(I — a), one may hope to get 
a good approximation of the original stochastic shortest path problem. However, pathological 
examples can be constructed to show that a solution to the stochastic shortest path problem 
may not exist, whereas minimization of V^ssp[Ti,x) is always well defined, although in either 
case the state may never reach the desired set K almost surely- P\ < 

3.3. Remark. Given a cost-per-stage function c on IK, one can redefine it to be c'(x,a) : 
= c(x,a)l;f^jf(x) to turn the problem (3.1) into the minimization of El^[2j^^Qa'c'(x,, a,)] 
for a e ]0, 1[. This cost functional can be equivalently written as an infinite horizon cost 

functional, as in E^[Xi^o"''^'f^i'°;)lfis:T}]> or as in E^ [Xi^o "''^f^i'°i)lfi<T}] • However, 
the absence of a policy that guarantees that (xj^gj^^ stays inside K for all time after t 
necessarily means that the problem (3.1) corresponding to the Markov control model in 
Definition 2.1 is not equivalent to the minimization of the infinite horizon cost functional 



3.4. A word about admissible policies. It is clear at once that the class of admissible policies 
for the problem (3.1) is different from the classes considered in §2. Indeed, since the process 
is killed at the stopping time t, it follows that the class of admissible policies should also be 
truncated at the stage t — 1 . For a given stage t e Nq we define the t -th policy element n ^ only 
on the set {t < t}. Note that with this definition tz^ becomes a ^Jt^,. -measurable randomized 
control (in general). It is also immediate from the definition of t that if the initial condition x 
is inside K, then the set of admissible policies is empty; indeed, in this case t = 0, and there 
is no control needed. In other words, the domain of tz^ is contained in the "spatial" region 
{(x, a) e IK I X eX \ iir,a e A(x)}; since is not defined on K, this is equivalent to being 
well-defined on {t < t}. 

3.5. Some re-definitions. To simplify the formulas from now on we let the cost-per-stage 
function to be defined on X\K. With this convention in place our problem (3.1) can be 
posed as the minimization of E^ [XiLo^ a'c(x;, aj] over admissible policies. Also, henceforth 
we redefine the set IK of state-action pairs to be IK := {(x, a) e X x ^4 1 x &X\K,a e A(x)}, 
and we note that this new set is a measurable subset of the original set of state-action pairs. 
Also, we let F be the set of selectors of the set- valued mapping X\K^ x • — > A[x) c A. 

Recall that a function g : IK — > R is said to be inf-compact on K if for every x e X and 
r e K the set {a e j4(x)|g(x, a) ^ r} is compact. A transition kernel Q on a measurable space 
X given another measurable space Y is said to be strongly Feller (or strongly continuous) if 
the mapping y • — > g(x)Q(dx|y) is continuous and bounded for every measurable and 
bounded function g : X — > R. A function g : IK — > R is lower semicontinuous (l.s.c.) if 
for every sequence (x^-,aj)jgf>j c IK converging to (x,a) e IK, we have liminfj^o^ g[xj,aj) ^ 
g(x, a); or, equivalently, if for every r e R, the set {(x, a) e IK g(x, a) ^ r} is closed in IK. 





< 




3.6. Assumption. In addition to Assumption 2.3, we stipulate that 
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(i) the setj4(x) is compact for every x eX, 

(ii) the cost-per-stage c is lower semicontinuous, nonnegative, and inf-compact on K, and 

(iii) the transition kernel Q is strongly Feller. ^ 

The following is our main result on expected discounted cost up to the first time t to hit 
K; a proof is presented later in this section. 

3.7. Theorem. Suppose that Assumption 3.6 holds. Then 

(i) The a-discount value function V* is the (positive) minimal measurable solution to the a- 
discounted cost optimality equation (a-DCOE) 



(3.8) 



: mm 

A(x) 



c(x, a) + a 



Q(dy|x,a)?(y) 



Vx eXxJC. 



(ii) There exists a selector e F such that /,(x) e A[x\ x & X \ K, attains the minimum 
in (3.8), i.e., 

r 



(3.9) 



V\x) = c{x,fJ + a 



Q(dy|x,/jr(y) VxeXxJC, 



and the deterministic stationary policy is a-discount optimal; conversely, if 
is a-discount optimal, then it satisfies (3.9). 



We observe that Theorem 3.7 does not assert that the optimal value function V* is unique 
in any sense. In §4 we prove a result (Proposition 4.6) under additional hypotheses that 
guarantees uniqueness of V*. 

Since we do not assume that the cost-per-stage function c is bounded, a useful approach is 
to consider the a-value iteration (a-VI) functions defined by 



(3.10) 



Vo(x) = 0, 



mm 

A(x) 



c(x, a) -I- a 



Q(dy|x,a) v„_i(y) 



;N, x&X\K. 



Of course we have to demonstrate that V*(x) = lim„_oo v„(x) for all x eX. 

The functions v„, n e N, may be identified with the optimal cost function for the minimiza- 
tion of the process stopped at the n A (t — l)-th step, i.e., 

'(n-l)A(T-l) 

^ a'c(Xi,a,) 



inf 



i=0 



To get an intuitive idea, fix a deterministic Markov policy n = (7i,)ieNo) ^nd take the first 
iterate Vj. From (3.10) it is immediately clear that Vi(x) = min^g^j-^jcCx, a) ifx^K, and not 
defined otherwise. For the second iterate, we have 



inf E'^ 

Tien ^ 



lA(T-l) 

Z 

i=0 



a'c(x,,a,) 



inf c(x, ttqCx)) -I- a 
Tien 



Q(d?i|x,7ri(x)) (?iM?i,^i) 



Note that only those sample paths that do not enter K at the first step contribute to the cost 
at the second stage. This property is ensured by the indicator function that appears on the 
right-hand side of the last equality above. 

3.11. Example. Let (xj^gj^^ be a Markov chain with state-space X = {1,2, ... ,m} and transi- 
tion probability matrix Q = [qij(a)],„xmj where the argument of q^^ depicts the dependence 
on the action a with A being a compact subset of R. Let K = {1,2,..., m'} for m' < m, 
fix a e ]0, 1[ and let c(x, a) := l;f^^(x). Suppose further that inf^q^j^a) > for all i, j e X; 
this means, in particular, that the target set K cannot be absorbing for any deterministic sta- 
tionary policy. Our objective is to find an optimal policy corresponding to the the minimal 
cost (3.9). The optimal value function V* is on and for every i e {m' -I- 1, . . . , m} we have 
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V*(i) = min^g^{,)[l;f^jf(i) + aJ^^j^Q(dy|i,a)V*(y)] = 1 + amin„g^(;)Xi;"L;„'+iq.;(°)^*0')- 
The most elementary case is that of m' = m — 1; then V*(m) = 1 + amin^g^f^j q„^(a)V*(m), 
and given a sufficiently regular function qmmCO this can be solved at once to get V*(m), 
which characterizes the function (vector) V* completely. The optimal policy in this case is 
/(m) e argmin^g^^^-, q^„(a)V*(m); if the function q^m^'^ is convex, then the minimum is 
attained on A and thus leads to a unique optimal policy. A 

Proof of Theorem 3.7. Recall from paragraph 3.5 that c is defined onX\K, IK = {(x,a) e 
X X a\x &X\K,a & j4(x)} and F is the set of selectors of the set-valued map X\K^ x • — > 
A[x) c A. We begin with a sequence of Lemmas. 

3.12. Lemma ([19, Lemma 4.2.4]). Let the functions u : IK — > R and : IK — > R, i e N, be 
l.s.c, inf-compact and bounded below. Ifu^ t u, then 

lim miniZ;(x, a) = minu(x, a) VxeX. 

i->oo 

3.13. Lemma (Adapted from [31]). Suppose that 

• A(x) is compact for each x ^X\K and IK is a measurable subset of[X\K) xA, and 

• V : IK — > Rjso IS a measurable inf-compact function, v(x, •) is l.s.c. on A(x)for each x e X. 
Then there exists a selector e F such that 

T(x,/i(x)) = v*(x) := minv(x,a) Vx &X-^K, 

and V* is a measurable function. 

3.14. Definition. Let LqCXxJC)"*" denote the convex cone of nonnegative extended real-valued 
measurable functions onX\K, and for every u e Lq[X \ K)^ let us define the map Tu by 

(3.15) X\K3X' — >ru(x):=inf 

AM 



c[x,a) + a Q(dy|x,a)iz(y) 
The map T is the dynamic programming operator corresponding to our problem (3.1). <) 



Having defined the dynamic programming operator T above, it is important to distinguish 
conditions under which the function Tu is measurable for iz e Lq[X\ K)'^. We have the 
following lemma. 

3.16. Lemma. Under Assumption 3.6, the mapping T in (3.15) takes Lq(X\K)'^ into itself. 
Moreover, there exists a selector / e F such that Tu defined in (3.15) satisfies 

c 



(3.17) Tu{x) = c{x,f) + a 



Q(dy|x,/)iz(y) VxeXxi^. 



Proof Fix u e Lq{X \ KY ■ The strong-Feller property of Q on IK and lower-semicontinuity of 
the cost-per-stage function c defined on K show that the map 

IK3 (x,a) ■ — > T'u{x,a) := c{x,a) + a Q(dy|x,a) 

JX 

is lower-semicontinuous. From nonnegativity of u it follows that for every x & X\K and 
r eR, 

(3.18) K' := {a eA(x)|rii(x,a) ^ r} C {a eA(x)|c(x,a) ^ r}, 

and the set {a e j4(x)|c(x,a) ^ r} is compact by inf-compactness of c. Since by definition 
rii(x) = inf^(^-) r'ii(x,a), by Lemma 3.13 it would follow that a selector / exists such that 
ru(x) = r'iz(x,/(x)) y X &X\K once we verify the hypotheses of this Lemma. For this we 
only have to verify that T'u is l.s.c. (which implies it is measurable) and inf-compact on IK. 
We have seen above that T'u is a l.s.c. function on IK. Therefore, for each x gX\K the map 
T'u{x, •) is also l.s.c. onj4(x). Thus, by definition of lower semicontinuity, the set Jf' in (3.18) 
is closed for every x & X\K and r e R. Since a closed subset of a compact set is compact, it 
follows that K' is compact, which in turn shows inf-compactness of T'u on IK and proves the 
assertion. □ 
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The following lemma shows how functions u e Lq(X\ JC)"*" satisfying Tu relate to the 
optimal value function. 

3.19. Lemma. Suppose that Assumption 3.6 holds. Ifu& Lq(X\K)'^ is such that u ^ Tu, then 
u ^ V*. 

Proof. Suppose u e Lq[X\K)^ satisfies u ^ Tu, and let / be a selector (whose existence is 
guaranteed by Lemma 3.16) that attains the infimum in (3.15). Fix x & X\K. We have 



ii(x) ^ Tu[x) = c{x,f) + a 



Q(d?i|x,/) 



The operator T in (3.15) is monotone, for if u, u' e Lq{X\KY are two functions with u ^ u' , 
then clearly Tu ^ Tu' due to nonnegativity of c. Therefore, iterating the above inequality for 
a second time we obtain 

r 



u(x) ^ c(x,/) + a 



Q(d?i|x,/) W(ei)c(?i,/) 



Q(d?i|x,/)l;,,^(?i) 



After n such iterations we arrive at 

'(n-l)A(T-l) 



^ E{° 



^ a'c(x;,/) 



Q(?2l?l,/) 



+ Ef [a"u(xjl{„<,}]. 



Since u ^ 0, letting n ^ oo we get 

u{x)^V{f,x)^V\x). 
Since x e X \ is arbitrary, the assertion follows. 



□ 



The next lemma deals with convergence of the value iterations to the optimal value func- 
tion. 

3.20. Lemma. Suppose that Assumption 3.6 holds. Then v„ 1 V* on X\K, and the function V* 
satisfies the a-DCOE (3.8). 

-l)A(T-l) , 



Proof Note that since v„(x) = inf^^n [Xi[=o^''^'^^ ^■'a'c(x;,a,)J for x e X \ JC, it follows 
that 



v„(x) ^ 



■(n-l)A(T-l) 

^ a'c(x;,a,) 



T-l 



^a'c(x„ai) 



and therefore, taking the infimum over all policies ti e n on the right hand side, we get 
(3.21) v„(x)^V*(x) VxeXxJf. 

Since the cost-per-stage function is nonnegative, T is a monotone operator Therefore, since 
Vq := and v„ = rv„_i for n e N, it follows that the a-VI functions form a nondecreasing 
sequence in Lq[X~\K)'^, which implies that v„ t v* for some function v* e Lq[X\K)'^. For 
n e N we define 

r 



< (x, a) • — > r'v„(x, a) := c(x, a) + a 



Q(dy|x,a)l;f^jf(y)v„(y)< 



IK 9 (x, a) ^ r'v*(x, a) := c(x, a) + a Q(dy |x, a) e K. 

The monotone convergence theorem guarantees that r'v„ t T'v* pointwise on IK. As in the 
proof of Lemma 3.16 one can establish inf-compactness and lower semicontinuity of r'v„, 
and T'v* on IK. From Lemma 3.12 it now follows that for every x gX\K we have 

v*(x) = lim v„(x) = lim rv„_i(x) 

n— *oo n— »co 

= lim minr'v„_i(x, a) = minr'v*(x,a) 

n^oo A(x) A(x) 
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This shows that v* satisfies the a-DCOE, v* = Tv*. 

It remains to show that v* = V*. But by Lemma 3.19, v* = Tv* implies that v* ^ V* and 
the reverse inequaHty follows from (3.21) by taking limits as v* = lim^..^^ v„ ^ V*. □ 

3.22. Lemma. For every deterministic stationary policy f °° we have 



(3.23) 



Proof. Fix a deterministic stationary policy /°° and x ^X\K. The a-discounted cost V(/°°,x) 
corresponding to this policy satisfies, in view of the definition of t and the fact that x e X\ K, 



T-l 



^a'c(x,,/) 



Er 



T-l 



c(x,/) + 2a'c(x„/) 



(3.24) 



:c(x,/) + aEf 



T-l 



2a'-ic(x„/) 



But then by the Markov property, 

't-1 



2a'-ic(x„/) 



i=l 



T-l 



2a-ic(x„/) 



1=1 



W(y)Q(dy|x,/) E/" 



^IA(T-I) 
T-l 



2a'-ic(x„/) 



i=l 



^1 = y 



W(y)Q(dy|x,/)v(/~,y). 



This substituted back in (3.24) gives (3.23). 



□ 



Proof of Theorem 3.7. (i) That V* is a solution of the a-DCOE follows from Lemma 3.20, and 
that V* is the minimal solution follows from Lemma 3.19, since u = Tu implies u ^ V*. 

(ii) Lemma 3.16 guarantees the existence of a selector e F such that (3.9) holds. Fix 
n e N and x & X~\K. As in the proof of Lemma 3.19, iterating equation (3.9) n-times we 
arrive at 



V*(x) = E; 



■(n-l)A(T-l) 

^ a'c(Xi,/J 



+ Ei* [a"V*(xJl,„<,;] ^ E; 



■(n-l)A(T-l) 

^ a'c(x„/J 



By the monotone convergence theorem we have 



V*(x) ^ lim E 



■(n-l)A(T-l) 

^ a'c(x„/J 



^a'c(Xj,/J 



which shows that V*(x) ^ V(/J^,x), and since x ^X\K is arbitrary, it follows that V*(0 ^ 
V[f^, •)• The reverse inequality follows from the definition of V* in (3.1). We conclude that 
V*(0 = V{f^, •), and that f^ is an optimal policy 

For the converse, if f^ is an optimal deterministic stationary policy, then by Lemma 3.22, 
equation (3.23) becomes 



W) = n/~,x) = c(x,/J + a 
for X &X\K, which is identical to (3.9). 



Q(dy|x,/JW(y)v(/r,y) 



□ 
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§4. A Contraction Mapping Approach 

For the purposes of this section we let Lq[X\K) denote the real vector space of real-valued 
measurable functions on X, and Lq[X\K)^ be the convex cone of nonnegative elements of 
Lq[X\K). (Note that according to paragraph 3.5 we let the elements of Lq[X\K)'^ take the 
value +00.) Given a measurable weight function w : X\K — > [1, oo[ in Lq[X\K)'^, we define 
the weighted norm ||u||„, := sup^g;^ |u(x)| /w(x). It is well-known that [Lq(X~\K), ||-||„,) is a 
Banach space. 

4.1. Assumption. In addition to Assumption 3.6, we require that there exist c > 0, /3 e 
[l,l/a[, and a measurable weight function w ■.X\K — > [l,oo[ such that for every x &X\K 

(i) sup c(x, a) ^ cw(x); 

(ii) sup 



Q(dy|x,a)w(y)^/3w(x). 



4.2. Remark. If c is bounded, the weight function w may be taken to be Ix-^k- Also, if x 
and x"*" are the current and the next states of the Markov control process, respectively, then 
Assumption 4.1(ii) imphes that 

supE[w(x+)l,^+g;f^^;|(x,a)] ^^w(x) VxeXxiC. 

A(x) 

We observe that this bears a resemblance with classical Lyapunov-like stability criteria, more 
specifically, the Foster-Lyapunov conditions [27, Chapter 8], [17]. However, the condition 
in Assumption 4.1(ii) is uniform over the set of actions A[x) pointwise in x. It connects the 
growth of the cost-per-stage function c with a contraction induced by the discount factor a.< 

Recall that a mapping / : Y — > 7 on a nonempty complete metric space (Y,p) is a 
contraction if there exists a constant 7 e [0, 1[ such that p(/(xi),/(x2)) ^ yp(xi,X2) for all 
Xi,X2 e y. The constant 7 is said to the the modulus of the map/. A contraction has a unique 
fixed point x* e 7 satisfying /(x*) = x*. 

4.3. Proposition ([20, Proposition 7.2.9]). Let T be a monotone map from the Banach space 
(^Lq[X\ K),\\-\\„) into itself. If there exists a j ^ [0,1 [such that 

(4.4) T(iu + rw)^T(iu) + Yrw whenever u e (Lo(X\ ]<:), |M|„), r e R, 

then T is a contraction with modulus y. 

We have the following lemma. 

4.5. Lemma. Under Assumption 4.1, the map T in (3.15) is a contraction on [Lq[X\K)'^ , 
with modulus y = olP < 1. 

Proof Fix u e Lq[X\K)'^ with \\u\\^^, < 00. As in the proof of Lemma 3.16, the mapping 



I (x, a) ' — > r'iz(x, a) = c(x, a) + a 



Q(,dy\x,a)u[y)i 



is well-defined and l.s.c. in a &A[x) for all x e X\ K. By the same Lemma we also know that 
T maps Lq[X\K)^ into Lo(X\ JC)"*". For every (x, a) e K, by Assumption 4.1, 



Q(dy |x, a)——w[y) ^ cw(x) + a \\u\\ 
^ (c + a^||iiL)w(x), 



|r'(x, a)| ^ c(x, a) + a 



Q(dy|x,a)w(y) 



which shows that ||r'u||^ ^c + afi Therefore, T maps (Lo(X\ JC)+, |M|„) into itself 

Since c 5^ 0, it is clear that T is a monotone map on (LqCXxJC)"*", IML). By Assumption 4.1 (ii), 
for r e K and x e X \ we have 

Q(dy|x,a)(u(y) + rw(y)) 



T{u + rw)(x) = min c(x, a) + a 
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^ min c(x,a) + a Q(dy |x, a)iz(y) + ra/3w(x) 

V Jx^K 
^ Tu[x) + ra^w(x). 

This shows that (4.4) holds with 7 = a/3, and Proposition 4.3 imphes that T is a contraction 
on (Lo(X\i^)+,IML,)- □ 

The following proposition establishes bounds for the distance between the optimal value 
function V* and the a-VI functions (v„)„gpj^ by employing the contraction mapping T of 
Lemma 4.5. 

4.6. Proposition. Suppose that Assumption 4.1 holds, and let j := afS. Then: 

(i) The a-discount value function V* satisfies ^ c/(l — 7). 

(ii) The a-VI fiinctions (v„)„gp5j^ satisfy 

V*(^)-v„(x)^cw(x)|^j^j VxeXxJf, VneN. 

In particular, || v„ — ^ C7"/(l — 7) VneNg. 

(iii) The optimal value function V* is the unique function in (LqCXn Jf)""", that solves the 
a-DCOE (3.8). 

Proof (i) Let n be an arbitrary Markov policy. Trivially we have [w(xo)] ^ w(x). Fix 
i e N, and a history h, e iJiAx- In view of Assumption 4.1(ii), on the event {i < t} we have 

- 

Q(dy|Xi_i,a;_i)w(y) ^ /3w(x,_i) Va, eA(x,), 

which shows that [w(x;)1j;<t.j] ^ /3E^ [w(x;_i)1j,<t-j] . Iterating this inequality we arrive 
at E" [w(x,)lj;<^j] ^ ^'w(x). Also, by Assumption 4.1(i) we have c(x;,a;) ^ cw(x;) for all 
I e Nq such that i < t, which in conjunction with the above inequality gives 

(4.7) E^[c(x„a,)l5,<,}] ^ c/3'w(x). 

By the monotone convergence theorem and (4.7) we have 



E;j[w(x,)|h,_i,a,_i] 



^a'c(x„a,)lj,<^j 



i=0 



y(7r,x)= E 

(4.8) 

^c^(a/3)'w(x)^w(x)-^ 



^^a'E;^ [c(x„a,)l,,<,;] 



i=0 

C 



i=0 



It follows immediately that ||V*||„ = ||infn V(n;,x)||^ ^ c/(l - j). 

(ii) By definition, the a-VI functions (v„)„gj5j^ satisfy v„ = Tv^.^ = T"vq, with Vg := 0. Since 
T is a contraction on [Lq[X\K)'^, by Lemma 4.5, it follows that T has a unique fixed 
point, which, by definition is V*, since ||V*||„ < 00 by (i). A standard property of contraction 
maps implies that 

||^"^o-vi„^r"||vo-V*||„ ViieLo(X\i^)+,||iz|L<cx), VneNo. 

With the bound on obtained in (i), we get ||v„ — ^ c • — 7). Since T is 

also a contraction on (Lo(^\ ^f)^, lhlL)j ^nljc = Oj ^nd v„ t V*, the last inequality yields 
V*(x) - v„(x) ^ cw(x)7"/(l - 7) for every x&X\K. 

(iii) Of course V* solves the a-DCOE (3.8). Uniqueness follows from the facts that the 
operator T in (3.15) is a contraction by Lemma 4.5, and that the fixed point of a contraction 
mapping in a Banach space (or more generally, in a complete metric space) is unique. □ 

Note that the conditions in Assumption 4.1 are automatic if c is bounded. This gives the 
following straightforward result. 

4.9. Corollary. Suppose that Assumption 3.6 holds, andc:= sup][jc(x,a) < 00. Then: 
(i) The a-discount value function V* satisfies \\V*\\ ^c/[l — a). 
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(ii) The a-VI functions (v„)„gp5j satisfy 



VXx)-v„[x)^c^- J VxeXxJC, VneN. 

In particular, ||v„ — V*|| ^ c'a"/(l — a) VneNg. 
(iii) The optimal value function V* is the unique function in (^Lq[X\.K)^, that solves the 
a-DCOE (3.8). 

§5. Asymptotic Discount Optimality of the a-VI Policy 

We have seen that the a-value iteration functions (v„)„gpj^ defined in (3.10) converge to V* 
by Lemma 3.20. In this section we address the question whether the a-VI pohcies converge 
in some sense to a pohcy f^ as n ^ oo. 

5.1. Definition. Let (v„)„gi>,^ be the sequence of a-VI functions in (3.10), and let n = [fn)neNo ^ 

be a deterministic Markov poHcy such that /g e F is arbitrary, and for n e N, 

- 

v„(x) = c(x,/„) + a Q(dy|x,/„)v„_i(y) VxeXxJf. 

Then jt is called an a-VI policy. -0 
Under Assumption 3.6 we get the following basic existential result. 

5.2. Proposition. Suppose that Assumption 3.6 holds, the action space A is locally compact, and 
let n = {fn)n€No ^ ^DM ^e an a-VI policy as defined in Definition 5.1. Then there exists a selector 
/ e F such that for every x &X\K, /(x) e j4(x) is an accumulation point of ifn(^^)neNa' '^'^^ 
the corresponding deterministic stationary policy f°° e Tl^g is a-discount optimal. 

The proof is based on the following immediate adaptation of [19, Lemma 4.6.6]. 

5.3. Lemma. Let u and n e N, be l.s.c. functions, bounded below, and inf-compact on K. For 
every n e N Zet u*(x) := min^(^)U„(x,a) and u*(x) := min^(;^) iz(x, a), let /„ e F be a selector 
such that u*(x) = u„(x,/„(x)) for aZZ x e X \ JC. If A is locally compact and ii„ t u, then 
there exists a selector / e F such that /(x) G A[x) is an accumulation point of the sequence 
UnM)^^j^for every x&X\K,and u*(x) = u(x,/(x)). 

Proof of Proposition 5.2. For (x, a) e K we define u(x, a) := c(x, a)-\-a j^^^ Q(dy |x, a)V*(y), 
and 



r 



(5.4) ii„(x,a) := c(x,a)-|- a 



Q(dy|x,a)v„_i(y). 

Since c ^ 0, the functions ii„ and u are nonnegative. Since v„ t V* by Lemma 3.20, the 
monotone convergence theorem implies that 



Q(dy|x,a)v„(y) • 



Q(dy|x,a)V*(y) 



pointwise on IK. It is clear that ii„ t and the assertion follows at once from Lemma 5.3. □ 

Under the stronger Assumption 4.1 we get quantitative estimates of the rate at which the 
a-VI policy defined in Definition 5.1 converges to an optimal one. 



5.5. Definition. The function D : K — > R^g defined by 



' (x, a) ' — > D[x, a) := c(x, a) -I- a 



Q[dy\x,a)V*[y)-V*ix) 



is called the a-discount discrepancy function. The a-VI policy n = (/n)neNo defined in Defi- 
nition 5.1 is called pointwise asymptotically discount optimal if for every x e Xn we have 
lim„^^D(x,/„)=0. 
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It is clear that for x & X\K and a selector / e F (see paragraph 3.5), the a-discount 
discrepancy function D(x,/(x)) is if and only if /°° is an optimal policy. The function D 
measures closeness to an optimal selector in a weak sense. 

5.6. Proposition. Suppose that Assumption 4.1 holds, and let y := a/3. Then the a-VI policy 
n = (/„) jjgj^^ IS pointwise asymptotically discount optimal, and for every x ^X\K and n e N, 

0^D{x,f,)^2c(j—^^wM. 

Proof. The first inequality follows directly from the definition of V*. To prove the second 
inequality fixx & X\K. We see that by the definition of the discrepancy function, 

/• 

=c(x,/„) + a 

(5.7) 

= (v„+i(x)-V*(x)) + a 
By Proposition 4.6(ii) we have 



Q(dy|x,/„)V*(y)-V*(x) 

Q(dy|x,/„)(V*(y)-v„(y)). 



(5.8) |v„+i(x)-VV)|^cw(x)^ 
and in the light of Assumption 4. 1 (ii) we arrive at 



(5.9) 



Q(dy|x,/„)(V*(y)-v„(y)) ^ Q(dy|x,X,) - v„(y)) 



^ -^/3w(x). 

1-r 



The assertion follows immediately after substituting (5.8) and (5.9) in (5.7). □ 

For bounded costs we have the following straightforward conclusion. 

5.10. Corollary. Suppose that Assumption 3.6 holds, andc:= sup][jc(x,a) < oo. Then the a-VI 
policy n = {fn)neN ^■^ pointwise asymptotically discount optimal, and for every x &X\K and 
n&N, 

0^D(x,l)^2c 



1-a 

§6. Average cost of recovery 

As mentioned in §1, a motivation for this work was to come up with a suitable recovery 
strategy for MPC. Tracing our development of the MFC methodology in §1, one sees that in 
the presence of state and/ or action constraints, one seeks a deterministic stationary policy 
that is active whenever the state is inside the safe set K, and a recovery strategy outside 
K. Let us assume that for a given problem we have determined such a policy, and we have 
also determined a deterministic stationary policy f^ corresponding to the recovery strategy 
corresponding to a cost-per-stage function defined onX\K for the same problem as described 
in the preceding sections. One of the natural questions at this stage is whether one can find 
estimates of the average cost of recovery. 

To this end let us define two constants: 



(6.1) ft := inf 



Q(dy|x,gjV*(y), ft:=sup 



Q(dy|x,gjv*(y), 



where V* is as defined in (3.1). Let /°° be the deterministic stationary policy defined by 
(6.2) /(x) := A(x) W(x) + g,(x)l^(x); 

to wit, f°° consists of concatenation of /^°° and between exit and entry times to K. We 
have the following result: 
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6.3. Proposition. Let be a deterministic stationary policy that is active whenever the state 
is inside the set K, and let he a recovery strategy corresponding to the problem (3.1). Let the 
initial condition x be in X\ K. We define the average cost of recovery 



W(x) := lim 



n-.oo n + 1 



i=0 t = T2i 



where f°° is as defined in (6.2), Tq := 0, Tj^ is the first entry time to K, T2 is the first exit time 
from K after t^, and so on. Suppose that from any initial condition in X\K the first hitting 
time of K is finite almost surely under f^, and from any initial condition in K the first hitting 
time ofX\K is finite almost surely under g^. Then we have Pi ^ W[x) ^ ^2, where Pi, P2 
as defined in (6.1). 

Note that an identical bound holds if the initial condition x & with an obvious relabelling 
of the stopping times (T,),gis,^. 

Proof First of all, note that the policy /°° is deterministic stationary, and under this policy 
the controlled process is stationary Markov. Now we have for a fixed n e N: 



(6.4) 



n T 2i+i- l 



;=o t=T2i 



i=0 



t=T„- 



}7 



:^Er[E/"[nx.j|;?.j], 



where the first equality follows from monotone convergence and the last equality from the 
strong Markov property. Appealing to the strong Markov property once again we see that 
E-''-" [V*(^T2i)|5T2i] = E-^" [V*(^T2i)|^T2,] • Finally from the definition of it follows that 

E/"[V*(x,j|x,J ^sup \ Q(dy|?,/iJvXy)W(y) = fe. 

It is not difficult to arrive at the lower bound E-'^-" [V*(x,.^.)|xt-^.] ^ Pi by following the same 
steps as above. Substituting in (6.4) and taking limits we arrive at the assertion. □ 



§7. A Rolling Horizon Implementation 

The rolling-horizon procedure can be briefly described as follows. Fix a horizon iV e N and 
set n = 0. Then 

(a) we determine an optimal control policy, say n*^.^^^, for the (iV + l)-period cost func- 
tion starting from time n, given the (perfectly observed) initial condition x„; stan- 
dard arguments lead to a realization of this policy as a sequence of (JV -I- 1) selectors 
{fn,n+N-j\j = n,n + l,...,n + N}; 

(b) we increase n to n -I- 1, and go back to step (a). 

Accordingly, the n-th step of this procedure consists of minimizing the stopped (JV -I- 1) -period 
cost function starting at time n, namely, the objective is to find a control policy that attains 



(7.1) 



^^iK,n+Ni'^,x)--= infE'' 



(n+W)A(T-l) 



a 



i-nA(T-l) 



c(x,-,a,) 



i=nA(T-l) 



nA(T-l) ■ 



ior X & X\K. By stationarity and Markovian nature of the control model, it is enough to 
consider the control problem of minimizing the cost for n = 0, i.e., the problem of minimizing 
Vo,n('^)^) over 71 e n. The corresponding policy n is given by the policy that minimizes the 
(JV -I- l)-stage a-VI function v^^+j in (3.10). This particular policy is reaHzed as a sequence 
of (JV -I- 1) selectors (/n,---,/o)- Thus, in the Hght of the above discussion, the rolling- 
horizon procedure yields the stationary suboptimal control policy n := f^ for the original 
problem (3.1). 
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Let V[f^,x) be the value function corresponding to the deterministic stationary policy 
/n' ''~ On'/nj • • •)' ^ &X\K. Observe that V(J^,x) < oo, which follows from the more 
general estimate in (4.8). Our objective in this section is to give quantitative estimates of the 
extent of sub-optimality of the rolling-horizon policy n, compared to the optimal policy n* 
that attains the infimum in (3.1). We shall follow the notations of §4 above. 

7.2. Theorem. Suppose that Assumption 4.1 holds, and let y := ajS. For every iV e Ng and 
X e X \ we have 

(7.3) O^V(/~,x)-Vn+i(x)^cw(x)( ^ 

where Vj^+i is the (JV + l)-th a-VI function defined in (3.10). In particular, 



(7.4) x)-W)^cw(x) 



A proof of Theorem 7.2 is given in the Appendix, if follows the arguments in [3, Theo- 
rem 1] for finite state-space Markov decision processes and bounded costs. It is of interest to 
note that the bound in (7.3) is identical to the bound between V*(0 and Vjv+i(0 that appears 
in Proposition 4.6. 

If the cost-per-stage function c is bounded on IK, we have the following immediate corol- 
lary: 

7.5. Corollary. Suppose the Markov control process satisfies Assumption 3.6. Let the cost-per- 
stage function c : IK — > Rj,o be bounded, with c := supjj c(x, a) < oo. Then V (J^, x) ^ V*(x) 
for every x &X\K, and 

sup {vif^,x)-V\x))^- . 



§8. Application 

In this section we give a numerical example concerning fishery management. The example 
is motivated by [18, Chapter 7]. The example considers a fishery modeled in discrete-time 
with the time period representing a fishing season. The state of the controlled Markov chain 
is the population of the fish species of interest. Fishermen might on the one hand want to 
harvest all that they can manage in order to increase their short-run profit, but on the other 
hand this might lead to very low levels of the population. Our goal is to design a recovery 
strategy for the case that the population gets over-fished and goes below a critical level. 

For doing so, we consider a simple model, with four possible fish population levels, 1 
(almost extinct), 2, 3, and 4 (the target set). We assume that we can accurately measure the 
population size at the beginning of each season k, X^. During a season the following set of 
actions are available: Harvest (1), Harvest less (2), Do nothing (3), Import fish (4), Import 
less (5). We also take as given the following transition probabilities between the Markov 
States, where T^[i,j) denotes the probability that the population level at the beginning of the 
next season will be j, given that the current population is i and action a is applied during this 
season. 



10 
0.7 0.3 
0.1 0.6 0.3 



0.99 0.01 
0.01 0.7 0.28 0.01 
0.03 0.65 0.32 
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0.4 
0.45 0.54 
0.45 




0.01 
0.55 



The costs incurred at each state are c(x;, a^) = C(x,) + A(x;, aj, where 

C(x,)=[300 150 100 -Y 
represents a cost incurred for being at the current state and 



A(Xi,a,): 



-20 -10 
-40 -20 
-80 -40 



150 75 
150 75 
150 75 



the action cost associated with each action and state. We assume a discount factor a = 0.9. 

Using this setting, one can compute the poHcy that attains the a-discount value func- 
tion (3.1). This turns out to be to import fish when in state (1), to import fewer fish in state 
(2), and do nothing at state (3). Next, we search for the optimum pohcy, while using a rolling 
horizon control scheme, i.e., finding the policy that attains (7.1). We solve the problem for 
horizon lengths between 1 and 10, in order to compare the results with the infinite horizon 
optimal policy. 
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Average accumulated cost 



- infinite liorizon 
receding liorizon 
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Figure 1. Accumulated cost average and standard-deviation 



g 8^ 



Average hitting time 



- infinite horizon 
' receding horizon 



g 3.8 

m 



standard deviation of hitting time 



- infinite horizon 
' receding horizon 



Figure 2. Hitting time average and standard-deviation 

Figure 1 shows the average and the standard-deviation of the accumulated costs over 
2 X 10^ Monte Carlo runs, with the initial population level at state 1. Similarly, Figure 2 
shows the average and the standard-deviation of the time steps needed for the recovery into 



16 



D. CHATTERJEE, E. CINQUEMANI, G. CHALOULOS, AND J. LYGEROS 



the target state 4. The results suggest that for the rolhng horizon pohcy to match the optimal 
infinite horizon one, a horizon length of at least 8 should be used. Smaller horizons provide 
sub-optimal policies (with respect to the infinite horizon one), with the sub-optimality gap 
reducing as the horizon length increases. Note that the case of iV = 1 is not included in the 
data; this is because for horizon length of 1 the optimal policy is to harvest while the system 
is at state 1, leading to an oo cost and recovery time, which does not allow the system to ever 
recover to state 4. 



§9. Future Work 

We established in §3 that the optimal value function V* is the minimal solution of the 
a-discounted cost optimality equation (3.8). However, obtaining analytical expression of the 
optimal value function V* is difficult, particularly due to the integration over a subset X\ 
of the state space. Obtaining good approximations of V* is of vital importance, and will be 
reported in subsequent articles. 

It is interesting to note that our basic framework of stochastic model-predictive control 
(described in §1) naturally leads to a partitioning of the state-space with different dynamics 
in each partition; thus, the controlled system may be viewed as a stochastic hybrid system. 
One of the basic questions in this context is that of stability of the controlled system, and in 
view of the fact that in general there will be infinitely many excursions of the state outside 
the safe set, establishing any stability property is a challenging task. Classical Lyapunov-based 
methods are difficult to apply directly precisely because of the infinitely many state-dependent 
switches between multiple regimes, each with different dynamics. However, excursion-theory 
of Markov processes [9] enables us to establish certain stability properties of quite general 
stochastic hybrid systems with state-dependent switching; some of these results are reported 
in [13]. 
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Appendix A. Proof of Theorem 7.2 



Proof of Theorem 7.2. For brevity of notation in this proof, we let jt := f^, and let tt^.j denote 
the (ordered) elements of the policy n from stage i through j for j > i. The first inequality 
in (7.3) is trivial because Vj^^j (x) ^ V^x) ^ V(/~,x) for all X e X \ JC. Before the proof 
of the second inequality in (7.3), let us fix some notation. Pick JV e Nq. For n e Nq, a policy 
'^n-.n+N for Stages n through n-\-N, and i e {n, . . . , n -I- JV}, let 7r„.„+jv(0 denote i-th element 
of the policy 7in:n+Ar- Also, let Q(-|x, 7r„.„+jv) denote the sub-stochastic kernel^ defined for 
X eXxJf by 



Q(b|x,7t„:„+w) := 



forB e ^{X\K). 

Let be an optimal policy for stages n through n-\-N, i.e., let attain the 

infimum in (7.1). Fix x & X\K. Let (^„+i.n_,.jv+i be an (iV -I- l)-period pohcy starting from 
stage n+ 1, such that its first JV elements are identical to the last N elements of 
C„+i:n+N+iO') = ":^:„+jvO') for j = H -l- 1, . . . , H -l- iV. By Optimality of 7i;^^_^^ we have 



n+N+l 



^ a'c(x„a,)l|; 



{i<T} 



x. 



(n+l)A(T-l) 



■^Recall that Q(-|-) is a sub-stochastic kernel on X\ K given Y if Q(-B|-) is a measurable function on Y for each 
B e QS(X), and Q(-| j) is a measure on X with QiX\y) ^ 1 for each yeY. 
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n+N+1 



i=n+l 



(n+l)A(T-l) 



Since TCnm+NM = T^n n+jvC") construction, conditional on x„^(t._i) = x' gX\K, 

'n+N+l 



(A.1) 



i=n+l 
'n+N+l 



*^(n+l)A(T-l) — y 



By definition of ^ we fiave 

'n+N+l 

^ a'c(x„a;)l,i<^j 



^ a'c(x;,ai)l,i<^ 



i=n+l 



*^(n+l)A(T-l) — y 



Ef» 



x. 



(n+l)A(T-l) 



n+l:n+N+l 



^ a'c(x„ai)l,,<^j 



,i=n+l 



(n+l)A(T-l) 



+ E^"+i-"+''+i [a"''''^'''''"c(x„+f^_,.i, a„+jv+i)l{ii+N+i<T}|^(n+i)A(T-i)] > 
and the right-hand side equals 

n+N 

^(n+l)A(T-l) 



^ a'c(x;,a,)l,,<^ 



+ E^"+i-"+"'+i [a"''''^'''''"c(x„+jv+lJ '^n+N+l)l{n+N+l<T} l-^Cn+ljAtx-l)] ■ 

In conjunction with (A.l) and conditional on x^^f_^_i-^ = x' GX\K,we have 

n+N 



Q(dyk',<^„^^(n))E<" 



X a'c(x„ai)l{,<^j 



i=n+l 



<^(n+l)A(T-l) — y 



+ 



Q(dy |x', n^.^^^lin)) Ef"+i-+«+i [a"+'^+ic(x„+w+i, an+iv+i)l{n+iv+i<T} |^(n+i)A(T-i) = j] 



:5 



^n:n+N(")) E' 



ii+l:n+W+l 



(i+N+l 



^ a'c(x,-,a,)l{i<^ 



i=n+l 



<^(n+i)A(T-i) — y 



To wit, conditional on x 



nA(T-l) " 



x' gX\K, 



^a'c(Xi,a,)lj,<^j 



-l-a 



n+N+l 



fnACT-1) = x - E<«+« [a''c(x„,a„)l{i<^j|x„^(^_i) = x'] 
Q(dy |x', 7T;^„+„(n)) E^"+i [c(x„+n+i> a„+^+i)l5„+jv+i<T} |^(n+i)A(T-i) = j] 



X-~,K 



Q(dy|x',7i„^„+w(n))E<+i:"+ 



n+N+l 



^ a'c(x,-,a,)l{i<. 



i=n+l 



<^(n+i)A(T-i) — y 



Let <^n+i.„+;^+i(n -l-iV + !)(•) be a selector that attains the minimal value of 



,n+N+l 



Q(dy |x', n^.^+^in)) E^"+i-+«+i [c(x„+j,+i, a„+w+i)l{n+N+i<T} |^(n+i)A(T-i) = j] 



whenever x' e X \ JC, and let the corresponding minimal value be denoted by e„(xO; clearly 
e„ is well-defined on Xx JC, and is a measurable function of x'. With this notation, the last 
inequality becomes 
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i=n 



fnA(T-l) ■ 



Q(dy|x',7i„^„+j,(n))E<+ 



- [a"c(x„,ajlj„<^}|x„/,(^_i) 

+N+1 

^ a'c(x„ai)lj;<^ 



:X'] 



i=n+l 



(n+l)A(T-l) - y 



whenever x' e X \ JC. Therefore, 



Q(dy|x,7fo:n-l)E''" 



n+N 



^a'c(x;,ai)l,,<^j 



^iiA(T-i) — y 



Q(dy|x,7ro:n-i)E''""+" [a"c(x„, a„)l{„<^j |x„^(^_i)] + Q(dy |x, 7ro:n-i)en(y) 



Q(dy|x,7To:„)E' 



(i+l:n+W + l 



^ a'c(x;,ajl,i<^ 



i=n+l 



*^(n+l)A(T-l) — y 



Rearranging and summing over n we arrive at 



(A.2) Q(dy|x,7io:„_i)E<:" 



n=0 JX-^K 



a"c(x„,ajl,„<. 



<^nA(T-i) — y 



71=0 

r 



Q(dy|x,7Io:„_i)E<. 



n+N 



— a 



Q(dy|x,7ro:„)E<+i- 



n+N+1 



^nA(T-i) — y 

^(n+l)A(T-l) = y 



^a' "c(x,-,a,)l{,<^ 
^ a'"""^c(x„ai)l{,<^j 

00 r 

+ X Q(dy|x,7To:n-l)en(y)- 

71=0 JxxJC 

In (A.2) we have employed the notation Jjj.^j^Q(dy|x, 7io._i)g(y) := g(x) for any pohcy 

71. We observe that the left-hand side of (A.2) is just [Xii=o^ ct'c(x;,a;)j . By Assump- 
tion 4.1(i), 

7J+N 

^a'""c(x;,a,)l,,< 



and by Assumption 4.1(ii), 



^nA(T-l) — y 



n+N 



2a-"w(x,)l,,<,; 



^nA(T-i) — y 



^a'-"w(x,)lj,<. 



•^7ia(t-i) — y 



7J+N 



We notice that since c ^ 0, the first series on the right-hand side of (A.2) is at most 

00 71+iV f 

(A.3) 



n=0 i=n 



Q(dy x,7io:„-i)w(y)- 



For a fixed n e Nq, the quantity J^^j^. Q(dy|x, Tio-n) w(y) is at most ^"+^w(x) in view of 
Assumption 4.1(ii) and the definition of the stochastic kernel Q(-|x, 7Tn:„+jv) at the beginning 
of this proof. Therefore, 

00 n+N f 00 n+N 

Zi QC'l^k' ^0:n-l)w(y) ^ S «" 2 r'""^"w(x) 



n=0 i=n JX^K 



n=0 i=n 
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This shows that series in (A. 3) is summable. Hence, cancellations of the telescopic terms 
in the first series on the right-hand side of (A.2) are justified. The inequality in (A.2) now 
simplifies to 



(A.4) 



i=0 



^ E 



■(N+1)A(t-1) 
i=0 



Q(dy|x,7io:„-i)e„(y)- 



n=0 J 



By Assumption 4.1(ii) and the definition of e„, conditional on x^^^ 



x' ^X\K, 



^ a 



n+N+l 



Q(dy |x', 7T;,„+^(n)) E^»«»+"+i [c(x, 



n+N+lJ "n+N+ 



{n+N+l<T} ■*-(n+l)A(T-l) 



■y] 



^ cw(^x')a"r^ 
Substituting the last inequality in (A.4) we arrive at 

'(N+1)A(t-1) 



^a'c(Xi,a,) 



< p"0:« 



^ a'c(Xi,a,) 



+ 



N+l 



1-r 



w(x), 



which is the second bound in (7.3). The inequality (7.4) follows immediately from the fact 
that V* ^ v„ for every n e N. □ 
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